perm filename HOWTO[4,KMC] blob sn#139993 filedate 1975-01-14 generic text, type T, neo UTF8


HOW TO MEASURE IMPROVEMENT OF A SIMULATION MODEL
   ALONG A DIMENSION OF LINGUISTIC COMPREHENSION

COLBY, HILF, WITTNER, PARKISON, FAUGHT


	To measure improvement one needs a  scaled  dimension  and  a
value   on   that   dimension  to  be  striven  for.  In  a  previous
communication (Colby and Hilf, 1974) a method was described for using
judges  to  rate  a  paranoid  simulation model's performance along a
variety of dimensions. The  judges  consisted  of  randomly  selected
psychiatrists  who  rated  transcripts  of  interviews  conducted  in
natural language by other psychiatrists with  paranoid  patients  and
with  versions of the model (PARRY1). The interviewers and the raters
did not know that one of the interviewees was a  computer  simulation
of paranoid processes.
	One  of the rated dimensions was linguistic noncomprehension.
(The negation "non" was used to  keep  the  ratings  consistent  with
other  ratings  being  made at the same time). A judge rated each I-O
pair of an interview along this dimension on  a  scale  of  0-9.  The
judges  proved to be reliable [Frank- concordance scores here on this
dimension]. The mean score received by the patients was 0.74  and  by
the  model  2.22.  The  difference  between  the  two mean ratings is
significant at better than the 0.001 level.
	Close  study of the reasons for this difference revealed that
the model recognized topics in the natural language input but did not
sufficiently  recognize exacly what was being said about a topic. The
pattern-recognition  processes  of  the  model  failed  to  pick   up
sufficient  information  about  a  topic  to  give a reply indicating
comprehension. The power of a pattern- matching approach in  language
recognition  is  the  ability  to  ignore  as irrelevant both what it
recognizes and what it does not recognize at all. Its  weakness  lies
in  not  having  enough  patterns  to match the tremendous variety of
expressions found in natural language dialogues.
	To improve the language-recognition processes of the model
we designed several additional techniques which we shall only outline
here. A complete description of them can be found in Colby, Parkison
and Faught (1974).
	In brief, the language-recognizing module of the current
paranoid model (PARRY2) progressively transforms the input until
a pattern is achieved which completely or fuzzily matches a more
abstract stored pattern. (See the flow diagram of Fig. 1). The
input expression is first preprocessed by translating words and
word groups (such as idioms) into internal synonyms which represent
our names of word classes. Words not in the recognizer's dictionary
are not included in the pattern being formed. Misspellings are
corrected, groups of words are contracted into single words, and
certain expansions are made (e.g. "dont" becomes "do not"). The
pattern is then bracketted into shorter, more manageable units
termed "segments". The resultant pattern is classified as "simple",
containing no  delimiters, or "complex", consisting of two or more
simple patterns.
	The algorithm then attempts a complete match of the
segments with stored simple patterns. When a match is found, the
stored pattern points to the name of a response function in
"memory" which decides what to do next. If a match is not found, a fuzzy
match is tried bt dropping elements in a segment one at a time
and trying for a match each time. In the case of complex patterns
this one-at-a-time dropping is carried out at the segment level. If
these methods do not produce a match, a default condition obtains
and the response module decides what to do.
	For this language-recognition strategy to be
successful, a large number of words and word-combinations
must be recognized and converted into patterns which match
stored patterns. In  the first experiment to be described, there
were 1900 dictionary entries and about 2200 patterns, 1700 being
simple and 500 complex.

		EXPERIMENT 1

		METHOD

	Five clinicians interviewed both the old (PARRY1) and
new (PARRY2) versions of the model without knowing which was which.
All five agreed PARRY2 showed greater linguistic comprehension.
To obtain a more precise estimate, 19 graduate students were
paid to rate transcripts of these interviews. They rated each
I-O pair of each interview along a dimension of "linguistic
comprehension" ("Did the patient understand what the doctor
said?") on a 0-9 scale.
		RESULTS

	In the 10 interviews there was a total of %%%% I-O pairs.
On a 0-9 scale of linguistic comprehension, the mean rating of
PARRY1 was 5.256 and the mean rating of PARRY2 was 5.483. This
difference is significant at the 0.05 level (t=1.0935, one
tailed test).
	These raters also rated transcripts of the original
eight interviews conducted by psychiatrists with PARRY1 and
with paranoid patients. PARRY1 received a mean rating 5.19 and
the patients 7.42. The difference is significant at the 0.001
level. This confirms the original test using psychiatrists
as raters. (Frank---how does it?)
	The student raters gave PARRY1 in the original interviews
a mean rating of 5.19 and a mean rating of 5.26 in the experiment
under discussion. The difference is not statistically significant
( SD(difference)=0.1497, t=0.45, p<0.80). We can conclude the
student raters are reliable and PARRY1 generates reliable
ratings from two groups of raters.

		DISCUSSION


	The improvement (more towards the ratings received by
patients) of PARRY2 over PARRY1 along the dimension of linguistic
comprehension is statistically significant. However Parry2's rating
of 5.48 is still distant from the rating of 7.42 received by the
patients. How close should a simulation model come to its natural
counterpart? Everybody knows that noboby knows. Perhaps we have
reached the limit of approximation. Intuitively it seemed the model
should be able to do better if we could pinpoint its most serious
inadequacies.
	We looked at each I-O pair which received a mean rating
of 5.0 or less. There were %%%% such cases. In %%% of these cases
the pattern was recognized but, dues to our own errors, the pointers
pointed to the wrong response functions. In the %%% remaining cases,
the pattern was not recognized. We corrected the pointers and then
repeated the experiment using five different clinicians who interviewed
PARRY1 and PARRY2.

                   EXPERIMENT 2